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Abstract. 1 Authoring documents in MKM formats like OMDoc is a 

^~j very tedious task. After years of working on a semantically annotated 

C^ corpus of sTbX documents (GenCS), we identified a set of common, time- 

^>! consuming subtasks, which can be supported in an integrated authoring 

environment. 
CTN We have adapted the modular Eclipse IDE into gTp?Xll E, an author- 

ing solution for enhancing productivity in contributing to sTeX based 
corpora. gTg Xll E supports context-aware command completion, module 
HH management, semantic macro retrieval, and theory graph navigation. 

o 

r/3 1 Introduction 

i O i 

Before we can manage mathematical 'knowledge' — i.e. reuse and restructure it, 

i— I adapt its presentation to new situations, semi-automatically prove conjectures, 

^ search it for theorems applicable to a given problem, or conjecture representation 

theorems, we have to convert informal knowledge into machine-oriented repre- 
sentations. How exactly to support this formalization process so that it becomes 
as effortless as possible is one of the main unsolved problems of MKM. Currently 
most mathematical knowledge is available in the form of IATjrX-encoded docu- 
ments. To tap this reservoir we have developed the J-TeX [Koh08,sTe09] format, 

C~j) a variant of IATj^X that is geared towards marking up the semantic structure 

underlying a mathematical document. 

In the last years, we have used J-TeX in two larger case studies. In the first 
one, the second author has accumulated a large corpus of teaching materials, 
comprising more than 2,000 slides, about 800 homework problems, and hun- 
dreds of pages of course notes, all written in SjIeX- The material covers a general 
first-year introduction to computer science, graduate lectures on logics, and re- 
search talks on mathematical knowledge management. The second case study 
consists of a corpus of semi-formal documents developed in the course of a ver- 
ification and SIL3-certification of a software module for safety zone computa- 
tions [KKL10a,KKL10b]. In both cases we took advantage of the fact that ^T^X 
documents can be transformed into the XML-based OMDoc [Koh06] by the 
WFeKML system [MillO], see [KKLlOa] and [DKL+10] for a discussion on the 
MKM services afforded by this. 
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lr The final publication of this paper is available at www.springerlink.com 
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These case studies have confirmed that writing cTeX is much less tedious 
than writing OMDoc directly. Particularly useful was the possibility of using the 
cffj^X-gcncrated PDF for proofreading the text part of documents. Nevertheless 
serious usability problems remain. They come from three sources: 
PI installation of the (relatively heavyweight) transformation system (with de- 
pendencies on perl, libXML2, DTeX, the cTj^X packages), 
P2 the fact that (JTeX supports an object-oriented style of writing mathematics, 

and 
P3 the size of the collections which make it difficult to find reusable components. 
The documents in the first (educational) corpus were mainly authored directly 
in J-TeX via a text editor (emacs with a simple cTj^X mode [Pes07]). This was 
serviceable for the author, who had a good recollection for names of semantic 
macros he had declared, but presented a very steep learning curve for other 
authors (e.g. teaching assistants) to join. The software engineering case study 
was a post-mortem formalization of existing (informal) DTeX documents. Here, 
installation problems and refactoring existing DTeX markup into more semantic 
f-Tj^X markup presented the main problems. 

Similar authoring and source management problems are tackled by Inte- 
grated Development Environments (IDEs) like Eclipse [Ec108] , which integrate 
support for finding reusable functions, refactoring, documentation, build man- 
agement, and version control into a convenient editing environment. In many 
ways, gTTiX shares more properties with programming languages like Java than 
with conventional document formats, in particular, with respect to the three 
problem sources mentioned above 

51 both require a build step (compiling Java and formatting/transforming gTf^X 
into PDF/OMDoc), 

52 both favor an object-oriented organization of materials, which allows to 

53 build up large collections of re-usable components 

To take advantage of the solutions found for these problems by software 
engineering, we have developed the cTeXIlE integrated authoring environment 
for gTEX-based representations of mathematical knowledge. In the next section 
we recap the parts of <-§IeX needed to understand the system. In Section 3 we 
present the user interface of the gT^ XII E system, and in Section 4 we discuss 
implementation issues. Section 5 concludes the paper and discusses future work. 

2 £TeX: Object-Oriented WT^X. Markup 

The main concept in cTeX is that of a "semantic macro" , i.e. a TeX command 
sequence S that represents a meaningful (mathematical) concept or object O: 
the TeX formatter will expand S to the presentation of O. For instance, the com- 
mand sequence \positiveReals is a semantic macro that represents a mathe- 
matical symbol — the set M. + of positive real numbers. While the use of semantic 
macros is generally considered a good markup practice for scientific documents 2 , 



2 For example, because they allow adapting notation by macro redefinition and thus 
increase reusability. 
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regular T^/DTjhX docs not offer any infrastructural support for this. cTfrX does 
just this by adopting a semantic, "object-oriented" approach to semantic macros 
by grouping them into "modules" , which are linked by an "imports" relation. 
To get a better intuition, consider the example in listing 1.1. 

Listing 1.1. An cjTffjX module for Real Numbers 

\begin{ module} [id— reals] 

\import module [../background/sets] {sets} 
\symdef{ Reals} {\mathcal{R}} 
\symdef{greatcr}[2]{#l>#2} 
5 \symdef{ posit ivcRcals}{\Rcals"+} 

\begin{dennition}[id—posrcals.dcf,title— Positive Real Numbers] 

The set $\positivcRcals$ is the set of $\inset{x}\Rcals$ such that $\grcatcr{x}0$ 
\end{ definition} 

lo\end{ module} 

which would be formatted to 

Definition 2.1 (Positive Real Numbers): 
The set R + is the set ofieR such that x > 

Note that the markup in the module reals has access to semantic macro 
\inset (membership) from the module sets that was imported by the document 
by \importmodule directive from the . . /background/sets. tex. Furthermore, 
it has access to the \def eq (definitional equality) that was in turn imported by 
the module sets. 

From this example we can already see an organizational advantage of cT^X 
over F/TgX: we can define the (semantic) macros close to where the corresponding 
concepts are defined, and we can (recursively) import mathematical modules. 
But the main advantage of markup in <-TeX is that it can be transformed to 
XML via the LTgXML system [MillO]: Listing 1.2 shows the OMDoc [Koh06] 
representation generated from the cTf^X sources in listing 1.1. 

Listing 1.2. An XML Version of Listing 1.1 

<theory xmhid— "reals" > 
< imports from—" . . /background/sets. omdoc#sets"/> 
<symbol xmhid— "Rcals"/> 
<notation> 
5 <prototype><OrvIS cd— "reals" name— "Reals" /></prototype> 
<rendering><m:mo>l</m:mo></rendering> 
</notation> 
<symbol xmhid— "greater"/><notation>. . ,</notation> 
<symbol xmhid— "positivcRcals"/Xnotation>. . .</notation> 
10 <definition xmhid— "posrcals.dcf for— "positivcRcals" > 

<meta property— "dcititlc" >Positivc Real Numbcrs</meta> 

The set <OMOBJ><OMS cd="reals" name="postiveReals" /></OMOBJ> is the set . . . 
</definition> 

l5</theory> 

One thing that stands out from the XML in this listing is that it incorporates all 
the information from the cSTfrX markup that was invisible in the PDF produced 
by formatting it with TgX. 
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3 User interface features of ^rgjXIEE 

One of the main priorities we set for cjI^ XII E is to have a relatively gentle learning 
curve. As the first experience of using a program is running the installation pro- 
cess, we worked hard to make this step as automated and platform independent 
as possible. We aim at supporting popular operating systems such as Windows 
and Unix based platforms (Ubuntu, SuSE). Creating an OS independent distri- 
bution of Eclipse with our plugin prcinstalled was a relatively straightforward 
task; so was distributing the plugin through an update site. What was chal- 
lenging was getting the 3rd party software (pelf latex, svn, latexml, perl) and 
hence OS specific ports installed correctly. 

After installation we provide a new project wizard for gTEX projects which 
lets the user choose the output format (.dvi, .pdf, .ps, .omdoc, .xhtml) as 
well as one of the predefined sequences of programs to be executed for the build 
process. This will control the ECLiPSE-like workflow, where the chosen 'outputs' 
are rebuilt after every save, and syntactic (as well as semantic) error messages are 
parsed, cross-referenced, and displayed to the user in a collapsible window. The 
wizard then creates a stub project, i.e. a file main.tex which has the structure of 
a typical cTj^X file but also includes stex package and imports a sample module 
defined in sample_mod.tex. 
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Fig. 1. Context aware autocompletion feature for semantic macros 

gl^ XII E supports the user in creating, editing and maintaining gTE^ docu- 
ments or corpora. For novice users we provide templates for creating modules, 
imports and definitions. Later on, the user benefits from context-aware autocom- 
pletion, which assists the user in using valid ETj^X and gTgX macros. Here, by 
valid macros, we mean macros which were previously defined or imported (both 
directly or indirectly) from other modules. Consider the sample gTE^ source in 
listing 1.1. At the end of the first line, one would only be able to autocomplete 
ET^X macros, whereas at the end of the second line, one would already have 
macros like \inset from the imported sets module (see Fig. 1). Note that we 
also make use of the semantic structure of the cTf^X document in listing 1.1 for 
explanations. Namely, the macro \positiveReals is linked to its definition via 
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the key f or=positiveReals, this makes it possible to display the text of the 
definition as part of macro autocomplction explanation (the yellow box) . 

Similarly, semantic macro retrieval (triggered by typing '\*') will suggest 
all available macros from all modules of the current project. In case that the 
auto-completed macro is not valid for the current context, gT^ XII E will insert 
the required import statement so that the macro becomes valid. 

Moreover, gT^ XII E supports several typical document/collection maintenance 
tasks: Supporting symbol and module name refactoring is very important as 
doing it manually is both extremely error-prone and time consuming, especially 
if two different modules define a symbol with the same name and only one of 
them is to be renamed. The module splitting feature makes it easier for users 
to split a larger module intro several semantically self contained modules which 
are easier to be reused. This feature ensures that imports required to make the 
newly created module valid are automatically inserted. 

At last, import minimization creates warnings for unused or re- Q < g 
dundant \importmodule declarations and suggests removing them. \ / 
Consider for instance the situation on the right, where modules C 
and B import module A. Now, if we add a semantic macro in C that 
needs an import from B, then we should replace the import of A in C with one 
of B instead of just adding the latter (i.e. we would replace the dashed by the 
dotted import). 
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Fig. 2. Macro Retrieval via Mathematical Concepts 

Three additional features make navigation and information retrieval in big 
corpora easier. Outline view of the document (right side of figure 1) displays 
main semantic structures inside the current document. One can use outline tree 
layout to copy, cut and navigate to areas represented by the respective struc- 
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tures. In case of imports one can navigate to imported modules. Theory graph 
navigation is another feature of i-Tj^XLLE. It creates a graphical representation of 
how modules are related through imports. This gives the author a chance to get a 
better intuition for how concepts and modules are related. The last feature is the 
semantic macro search feature. The aim of this feature is to search for semantic 
macros by their mathematical descriptions, which can be entered into the search 
box in figure 2. The feature then searches definitions, assumptions and theorems 
for the query terms and reports any \symdef -defined semantic macros 'near' the 
hits. This has proved very convenient in situations where the macro names are 
abbreviated (e.g. \sconcjuxt for "string concatenation by juxtaposition") or if 
there are more than one name for a mathematical context (e.g. "concatenation" 
for \sconcjuxt) and the author wants to re-use semantic macros defined by 
someone else. 



4 Implementation 

The implementation of cjTfc XMK is based on the TeXlipse [TeX08] plugin for 
Eclipse. This plugin makes use of Eclipse's modular framework (see Fig. 3 3) 
and provides features like syntax highlighting, code folding, outline generation, 
autocompletion and templating mechanisms. Unfortunately, TeXlipse uses a 
parser which is hardwired for a fixed set of F/IpX macros like \section, \input, 
etc. which made it quite challenging to generalize it to jTjtX specific macros. 
Therefore we had to reimplement parts of TeXlipse so that cTfrX macros like 
\symdef and \importmodule that extend the set of available macros can be 
treated specially. We have underlined all the parts of TeXlipse we had to 
extend or replace in Figure 3. 
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Fig. 3. Component architecture of TeXlipse (adapted from [TeXlO]) 

To support context sensitive autocompletion and refactoring we need to know 
the exact position in the source code where modules and symbols are defined. 
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Running a fully featured F/TgX parser like F/TjtXML proved to be too slow 
(sometimes taking 5-10 sec to compile a document of 15 pages) and sensitive 
to errors. For these reasons, we implemented a very fast but naive P/TjtX parser 
which analyses the source code and identifies commands, their arguments and 
options. We call this parser naive because it parses only one file a time (i.e. 
inclusions, and styles are not processed) and macros are not expanded. We realize 
the parse tree as an in-memory XML DOM to achieve format independence (see 
below). Then we run a set of semantic spotters which identify constructs like 
module and import declarations, inclusions as well as sections/subsections etc, 
resulting in an index of relevant structural parts of the cTjrX source identified 
by unique URIs and line/column number ranges in the source. For example, 
a module definition in cTjrX begins with \begin-Tmodule} [id=module_id] and 
ends in a \end{module}, so the structure identifying a module will contain these 
two ranges. 

Note that the DTjtX document model (and thus that of ^r^X) is a tree, 
so two spotted structure domains are either disjoint or one contains the other, 
so we implement a range tree we use for efficient change management: ^T[tX 
KE implements a class which listens to changes made in documents, checks if 
they intersect with the important ranges of the spotted structures or if they 
introduce new commands (i.e. start with '\'). If this does not hold, the range 
tree is merely updated by calculating new line and column numbers. Otherwise 
we run the naive DTeX parser and the spotters again. 

Our parser is entirely generated by a JavaCC grammar. It supports error 
recovery (essential for autocompletion) and does not need to be changed if a new 
macro needs to be handled: Semantic Spotters can be implemented as XQucries, 
and our parser architecture provides an API for adding custom made semantic 
spotters. This makes the parser extensible to new jTeX features and allows 
working around the limitation of the naive P/TjtX parser of not expanding macros. 

We implemented several indexes to support features mentioned in section 3. 
For theory navigation we have an index called Theorylndex which manages a 
directed graph of modules and import relationships among them. It allows a) re- 
trieving a list of modules which import/are imported by module X b) checking 
if module X is directly/indirectly imported by module Y . Symdef Index is an- 
other index which stores pairs of module URIs and symbols defined in those 
modules. It supports fast retrieving of (symbol, module) pairs where a symbol 
name starts with a certain prefix by using a trie data structure. As expected, 
this index is used for both context aware autocompletion as well as semantic 
macro retrieval features. The difference is that context aware autocompletion 
feature also filters the modules not accessible from current module by using the 
Theorylndex. Refactoring makes use of an index called Ref Index. This index 
stores (module URI, definition module URI, symbol name) triples for all symbol 
occurrences (not just definitions as in Symdef Index). Hence when the author 
wants to rename a certain symbol, we first identify where that symbol is defined 
(i.e. its definition module URI) and then query for all other symbols with same 
name and definition module URI. 
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5 Conclusion and Future Work 

We have presented the cjT^ XTTF, system, an integrated authoring environment 
for cTfrX collections realized as a plugin to the Eclipse IDE. Even though the 
implementation is still in a relatively early state, this experiment confirmed the 
initial expectation that the installation, navigation, and build support features 
contributed by Eclipse can be adapted to a useful authoring environment for 
<-T]tX with relatively little effort. The modularity framework of Eclipse and the 
TeXlipse plugin for DTjtX editing have been beneficial for our development. 
However, we were rather surprised to see that a large part of the support infra- 
structure we would have expected to be realized in the framework were indeed 
hard-coded into the plugins. This has resulted in un-necessary re-implementation 
work. 

In particular, system- and collection-level features of gTjjj XM E like automated 
installation, PDF/XML build support, and context-sensitive completion of com- 
mand sequences, import minimziation, navigation, and concept-based search 
have proven useful, and are not offered by document-oriented editing solutions. 
Indeed such features are very important for editing and maintaining any MKM 
representations. Therefore we plan to extend cTjtXTLE to a general "MKM IDE", 
which supports more MKM formats and their human-oriented front-end syntaxes 
(just like cTjtX serves a front-end to OMDoc in cTjrXffE). 

The modular structure of Eclipse also allows us to integrate MKM services 
(e.g. information retrieval from the background collection or integration of ex- 
ternal proof engines for formal parts [ALWF06]; see [KRZ10] for others) into 
this envisioned "MKM IDE", so that it becomes a "rich collection client" to a 
universal digital mathematics library (UDML), which would continuously grow 
and in time would contain essentially all mathematical knowledge envisioned as 
the Grand Challenge for MKM in [Far05] . 

In the implementation effort we tried to abstract from the <jTFjX surface 
syntax, so that we anticipate that we will be able to directly re-use our spotters 
or adapt them for other surface formats that share the OMDoc data model. The 
next target in this direction is the modular LF format introduced in [RS09]. This 
can be converted to OMDoc by the TWELF system, which makes its treatment 
directly analogous to <fT[iX, this would provide a way of information sharing 
among different authoring systems and workflows. 
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